FRI Research Discussion 2026-01-16
Intersubjective scoring
Cognitive tasks
Typically assess forecasting skill by examining the squared distance between a forecaster’s forecast and some scoring criterion over many items
\[ X_n = \frac{1}{K} \sum_{k=1}^K (F_{nk} - C_k)^2 \]
\[ X_n = \frac{1}{K} \sum_{k=1}^K (F_{nk} - C_k)^2 \]
In typical testing, \(C_k\) would be fixed and known
In forecasting, item outcome not known at time of testing
“Correct answer” more like a random variable
“Correct answer” more like a random variable
E.g., for an hypothetical item \(k\) that is normally distributed with an unknown mean and standard deviation, we can represent this as \[ O_k \sim \mathcal{N}(\mu_k, \sigma_k) \]
An optimal point forecast for this item \(F_k\) would be \(\mu_k\), so we would ideally score \(F_k\) by comparing it to \(\mu_k\)
\[ X_n^{\mu_k} = \frac{1}{K} \sum_{k=1}^K (F_{nk} - \mu_k)^2 \]
But, since \(\mu_k\) is rarely known, we typically use the outcome \(O_k\) as the scoring criterion instead.
\[X_n^{O_k} = \frac{1}{K} \sum_{k=1}^K (F_{nk} - O_k)^2 \text{ where } O_k \sim \mathcal{N}(\mu_k, \sigma_k)\]
Measurement error is clear if we rewrite this as
\[ X_n^{O_k} = \frac{1}{K} \sum_{k=1}^K (F_{nk} - (\mu_k + \epsilon_{O_k}))^2 \]
, and ideally, we would want to minimize \(\epsilon_{O_k}\).
Using peer comparisons to score forecasts instead of using outcomes
Why would this work?
Wisdom of the crowds: when many predictions are aggregated, individual errors tend to cancel out. Aggregate predictions tend to be more accurate than the average prediction within a crowd.
(Galton, 1907)
The logic of wisdom of crowds assumes that the distribution of forecasts for an item has the same mean as the outcome distribution \(\mu_{F_k} = \mu_{k}\).
\[ F_k \sim \mathcal{N} (\mu_k, \sigma_{F_k}) \]
Under the Central Limit Theorem, the average of many forecasts will converge on the average of the outcome distribution with error defined by the sample size of forecasters \(N\)
\[ A_k \sim \mathcal{N} (\mu_k, \frac{\sigma_{F_k}}{\sqrt{N}}) \]
Substituting the crowd aggregate \(A_k\) for the scoring criterion,
\[X_n^{A_k} = \frac{1}{K} \sum_{k=1}^K (F_{nk} - A_k)^2 \text{ where } A_k \sim \mathcal{N} (\mu_k, \frac{\sigma_{F_k}}{\sqrt{N}}) \]
\[ X_n^{A_k} = \frac{1}{K} \sum_{k=1}^K (F_{nk} - (\mu_k + \epsilon_{A_k}))^2 \]
As long as a crowd is unbiased (\(\mu_{F_k}=\mu_k\)) and sufficiently large \(\left(\frac{\sigma_{F_k}}{\sqrt N}<\sigma_k\right)\), on average \(\epsilon_{A_k}^2<\epsilon_{O_k}^2\) and the wisdom of that crowd \(A_k\) will be a better representation of the optimal point forecast \(\mu_k\) than \(O_k\).
Using peer comparisons to score forecasts instead of outcomes
Why would this work?
Wisdom of the crowds: when many predictions are aggregated, individual errors tend to cancel out. Aggregate predictions tend to be more accurate than the average prediction within a crowd.
As long as a crowd is unbiased (\(\mu_{F_k}=\mu_k\)) and sufficiently large \(\left(\frac{\sigma_{F_k}}{\sqrt N}<\sigma_k\right)\), on average \(\epsilon_{A_k}^2<\epsilon_{O_k}^2\) and the wisdom of that crowd \(A_k\) will be a better representation of the optimal point forecast \(\mu_k\) than \(O_k\).
(Galton, 1907)
\(N\) = 1,000 forecasters
\(K\) = 1,000 items
Generated forecasts for each forecaster on each item
\(\theta_n\) defined the expected distance between an item’s expected outcome and a forecasters’ forecast
Sampled combinations of:
Scored each forecast with:
Which scoring measure captures forecasters’ skill better?
Intersubjective scoring correlation increases with \(N\)
For \(N\) ≥ 16, intersubjective scoring captures original skill parameter better than ground-truth
Increased variance affects ground-truth scoring but not intersubjective scoring
Intersubjective scoring has stronger correlations with skill at lower \(N\) and \(K\) combinations than before
Types of intersubjective measures
If a forecaster correctly identified that \(\mu_k \neq \mu_{F_k}\) (i.e., bias in the crowd), proxy scoring would incorrectly penalize the forecaster for reporting their true belief
Metapredictions explicitly ask about beliefs of \(\mu_{F_k}\) but also require more effort from the forecaster
Tested for their real-time scoring ability but were surprisingly successful at predicting forecasting accuracy
But only explored with categorical forecasting items
| Ground-Truth Outcome | Intersubjective Outcome | |
|---|---|---|
| Categorical Item | Price of gas was between $3.50 and $4.00 | Average probability judgement about the price of gas being between $3.50 and $4.00 was 52% |
| Continuous Item | Price of gas was $3.78 | Average prediction for price of gas was $3.80 |
Measurement scale confounds reliability boost from intersubjective measures
(Himmelstein et al., 2023)
Previously identified skilled forecasters
Aggregating these superforecasters may serve as a better reference criterion than aggregating the general crowd
Proxy scoring with a superforecaster crowd aggregate has been an effective method
(Karger et al., 2021; Tetlock & Gardner, 2015)
Are intersubjective measures still good predictors of forecasting accuracy in the (superior reliability) Quantile Elicitation Format (QEF)?
Are proxy scores or metapredictions stronger predictors of forecasting accuracy?
Final wave of a longitudinal forecasting study (N = 894)
Forecasts on six items in the QEF
Metapredictions after each item:
Additional sample of N = 42 superforecasters
(Himmelstein et al., 2025)
Scored each forecast with:
Compared these scores to forecasting accuracy on separate set of thirty items
Ground-truth score distributions
Superforecaster aggregate more accurate more often
Superforecaster metapredictions and proxy scores strongest
How much variance in forecaster accuracy does each score explain?
Random effects for item and person, conducted dominance analysis
| Contribution | Proportion | |
|---|---|---|
| Proxy Super | .119 | .225 |
| Metaprediction Super | .118 | .223 |
| Proxy Crowd | .109 | .206 |
| Metaprediction Crowd | .094 | .178 |
| Ground Truth | .089 | .168 |
| \(R_{forecaster}^2\) | .53 | 1.00 |
(Azen & Budescu, 2003; Budescu, 1993; Luo & Azen, 2013)
Proxy scores effective way of finding select crowds to aggregate
Intersubjective measures still effective measure of forecasting ability
Intersubjective measures still effective measure of forecasting ability
Can reduce measurement error by modifying the scoring criterion
[Mostly placeholder content right now]
Goal
Current test versions, times, \(R^2\)s
Search algorithm for best items
Iterates over all items within a set of tasks and calculates \(R^2\) of a linear model containing the current item predicting forecasting accuracy
Adds item with top \(R^2\) / time to the “test”
Iterates over all remaining items, calculates \(R^2\) of a linear model containing sum scores of all items within each task either in the test already or current item
Adds item with top additional \(R^2\) / time to the test
Seems like we can decrease time of cognitive tests in FPT
Some items seem to be adding more noise than predictive ability
??
Atanasov, P., & Himmelstein, M. (2023). Talent spotting in crowd prediction. In M. Seifert (Ed.), Judgment in Predictive Analytics. Springer.
Atanasov, P., Rescober, P., Stone, E., Swift, S. A., Servan-Schreiber, E., Tetlock, P., Ungar, L., & Mellers, B. (2017). Distilling the Wisdom of Crowds: Prediction Markets vs. Prediction Polls. Management Science, 63(3), 691–706. https://doi.org/10.1287/mnsc.2015.2374
Galton, F. (1907). Vox Populi. Nature, 75(1949), 450–451. https://doi.org/10.1038/075450a0
Himmelstein, M., Budescu, D. V., & Ho, E. H. (2023). The wisdom of many in few: Finding individuals who are as wise as the crowd. Journal of Experimental Psychology: General, 152(5), 1223–1244. https://doi.org/doi.org/10.1037/xge0001340
Himmelstein, M., Zhu, S. M., Petrov, N., Karger, E., Helmer, J., Livnat, S., Bennett, A., Hedley, P., & Tetlock, P. (2025). The Forecasting Proficiency Test: A General Use Assessment of Forecasting Ability. OSF. https://doi.org/10.31234/osf.io/a7kdx
Karger, E., Monrad, J., Mellers, B., & Tetlock, P. (2021). Reciprocal Scoring: A Method for Forecasting Unanswerable Questions. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.3954498
Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.
Wilkening, T., Martinie, M., & Howe, P. D. L. (2022). Hidden Experts in the Crowd: Using Meta-Predictions to Leverage Expertise in Single-Question Prediction Problems. Management Science, 68(1), 487–508. https://doi.org/10.1287/mnsc.2020.3919
Zhu, S. M., Budescu, D. V., Petrov, N., Karger, E., & Himmelstein, M. (2024). The psychometric properties of probability and quantile forecasts. Preprint.
Questions?
Contact: jhelmer3@gatech.edu
Slides
Preprint